20 research outputs found

    CTC Variations Through New WFST Topologies

    Full text link
    This paper presents novel Weighted Finite-State Transducer (WFST) topologies to implement Connectionist Temporal Classification (CTC)-like algorithms for automatic speech recognition. Three new CTC variants are proposed: (1) the "compact-CTC", in which direct transitions between units are replaced with back-off transitions; (2) the "minimal-CTC", that only adds self-loops when used in WFST-composition; and (3) the "selfless-CTC" variants, which disallows self-loop for non-blank units. Compact-CTC allows for 1.5 times smaller WFST decoding graphs and reduces memory consumption by two times when training CTC models with the LF-MMI objective without hurting the recognition accuracy. Minimal-CTC reduces graph size and memory consumption by two and four times for the cost of a small accuracy drop. Using selfless-CTC can improve the accuracy for wide context window models.Comment: Submitted to Interspeech 2022, 5 pages, 2 figures, 7 table

    You Do Not Need More Data: Improving End-To-End Speech Recognition by Text-To-Speech Data Augmentation

    Full text link
    Data augmentation is one of the most effective ways to make end-to-end automatic speech recognition (ASR) perform close to the conventional hybrid approach, especially when dealing with low-resource tasks. Using recent advances in speech synthesis (text-to-speech, or TTS), we build our TTS system on an ASR training database and then extend the data with synthesized speech to train a recognition model. We argue that, when the training data amount is relatively low, this approach can allow an end-to-end model to reach hybrid systems' quality. For an artificial low-to-medium-resource setup, we compare the proposed augmentation with the semi-supervised learning technique. We also investigate the influence of vocoder usage on final ASR performance by comparing Griffin-Lim algorithm with our modified LPCNet. When applied with an external language model, our approach outperforms a semi-supervised setup for LibriSpeech test-clean and only 33% worse than a comparable supervised setup. Our system establishes a competitive result for end-to-end ASR trained on LibriSpeech train-clean-100 set with WER 4.3% for test-clean and 13.5% for test-other

    Confidence-based Ensembles of End-to-End Speech Recognition Models

    Full text link
    The number of end-to-end speech recognition models grows every year. These models are often adapted to new domains or languages resulting in a proliferation of expert systems that achieve great results on target data, while generally showing inferior performance outside of their domain of expertise. We explore combination of such experts via confidence-based ensembles: ensembles of models where only the output of the most-confident model is used. We assume that models' target data is not available except for a small validation set. We demonstrate effectiveness of our approach with two applications. First, we show that a confidence-based ensemble of 5 monolingual models outperforms a system where model selection is performed via a dedicated language identification block. Second, we demonstrate that it is possible to combine base and adapted models to achieve strong results on both original and target data. We validate all our results on multiple datasets and model architectures.Comment: To appear in Proc. INTERSPEECH 2023, August 20-24, 2023, Dublin, Irelan

    Dynamic Acoustic Unit Augmentation With BPE-Dropout for Low-Resource End-to-End Speech Recognition

    Full text link
    With the rapid development of speech assistants, adapting server-intended automatic speech recognition (ASR) solutions to a direct device has become crucial. Researchers and industry prefer to use end-to-end ASR systems for on-device speech recognition tasks. This is because end-to-end systems can be made resource-efficient while maintaining a higher quality compared to hybrid systems. However, building end-to-end models requires a significant amount of speech data. Another challenging task associated with speech assistants is personalization, which mainly lies in handling out-of-vocabulary (OOV) words. In this work, we consider building an effective end-to-end ASR system in low-resource setups with a high OOV rate, embodied in Babel Turkish and Babel Georgian tasks. To address the aforementioned problems, we propose a method of dynamic acoustic unit augmentation based on the BPE-dropout technique. It non-deterministically tokenizes utterances to extend the token's contexts and to regularize their distribution for the model's recognition of unseen words. It also reduces the need for optimal subword vocabulary size search. The technique provides a steady improvement in regular and personalized (OOV-oriented) speech recognition tasks (at least 6% relative WER and 25% relative F-score) at no additional computational cost. Owing to the use of BPE-dropout, our monolingual Turkish Conformer established a competitive result with 22.2% character error rate (CER) and 38.9% word error rate (WER), which is close to the best published multilingual system.Comment: 16 pages, 7 figure

    The CHiME-7 Challenge: System Description and Performance of NeMo Team's DASR System

    Full text link
    We present the NVIDIA NeMo team's multi-channel speech recognition system for the 7th CHiME Challenge Distant Automatic Speech Recognition (DASR) Task, focusing on the development of a multi-channel, multi-speaker speech recognition system tailored to transcribe speech from distributed microphones and microphone arrays. The system predominantly comprises of the following integral modules: the Speaker Diarization Module, Multi-channel Audio Front-End Processing Module, and the ASR Module. These components collectively establish a cascading system, meticulously processing multi-channel and multi-speaker audio input. Moreover, this paper highlights the comprehensive optimization process that significantly enhanced our system's performance. Our team's submission is largely based on NeMo toolkits and will be publicly available

    Target-Speaker Voice Activity Detection: a Novel Approach for Multi-Speaker Diarization in a Dinner Party Scenario

    Full text link
    Speaker diarization for real-life scenarios is an extremely challenging problem. Widely used clustering-based diarization approaches perform rather poorly in such conditions, mainly due to the limited ability to handle overlapping speech. We propose a novel Target-Speaker Voice Activity Detection (TS-VAD) approach, which directly predicts an activity of each speaker on each time frame. TS-VAD model takes conventional speech features (e.g., MFCC) along with i-vectors for each speaker as inputs. A set of binary classification output layers produces activities of each speaker. I-vectors can be estimated iteratively, starting with a strong clustering-based diarization. We also extend the TS-VAD approach to the multi-microphone case using a simple attention mechanism on top of hidden representations extracted from the single-channel TS-VAD model. Moreover, post-processing strategies for the predicted speaker activity probabilities are investigated. Experiments on the CHiME-6 unsegmented data show that TS-VAD achieves state-of-the-art results outperforming the baseline x-vector-based system by more than 30% Diarization Error Rate (DER) abs.Comment: Accepted to Interspeech 202

    Structural Defects in TiNi-Based Alloys after Warm ECAP

    Get PDF
    The microstructure, martensitic transformations and crystal structure defects in the Ti50Ni47.3Fe2.7 (at%) alloy after equal-channel angular pressing (ECAP, angle 90°, route BC, 1–3 passes at T = 723 K) have been investigated. A homogeneous submicrocrystalline (SMC) structure (grains/subgrains about 300 nm) is observed after 3 ECAP passes. Crystal structure defects in the Ti49.4Ni50.6 (at%) alloy (8 ECAP passes, angle 120°, BC route, T = 723 K, grains/subgrains about 300 nm) and Ti50Ni47.3Fe2.7 (at%) alloy with SMC B2 structures after ECAP were studied by positron lifetime spectroscopy at the room temperature. The single component with the positron lifetime t1 = 132 ps and t1 = 140 ps were observed for positron lifetime spectra (PLS) obtained from ternary and binary, correspondingly, annealed alloys with coarse-grained structures. This t1 values correspond to the lifetime of delocalized positrons in defect-free B2 phase. The two component PLS were found for all samples exposed by ECAP. The component with t2 = 160 ps (annihilation of positrons trapped by dislocations) is observed for all samples after 1–8 ECAP passes. The component with t3 = 305 ps (annihilation of positrons trapped by vacancy nanoclusters) was detected only after the first ECAP pass. The component with t3 = 200 ps (annihilation of positrons trapped by vacancies in the Ti sublattice of B2 structure) is observed for all samples after 3–8 ECAP passes

    Crystal Structure Defects in Titanium Nickelide after Abc Pressing at Lowered Temperature

    Get PDF
    The experimental results regarding the effect of warm (573 K) abc pressing with an increase in the specified true strain, e, up to 9.55, on the microstructure and crystal structure defects (dislocations, vacancies) of the Ti49.8Ni50.2 (at %) alloy are presented. It is shown that all samples (regardless of e) have a two-level microstructure. The grains-subgrains of the submicrocrystalline scale level are in the volumes of large grains. The average sizes of both large grains and subgrain grains decrease with increasing e to 9.55 (from 27 to 12 µm and from 0.36 to 0.13 µm, respectively). All samples had a two-phase state (rhombohedral R and monoclinic B19′ martensitic phases) at 295 K. The full-profile analysis of X-ray reflections of the B2 phase obtained at 393 K shows that the dislocation density increases from 1014 m−2 to 1015 m−2 after pressing with e = 1.84 and reaches 2·1015 m−2 when e increases to 9.55. It has been established by positron annihilation lifetime spectroscopy that dislocations are the main type of defects in initial samples and the only type of defects in samples after abc pressing. The lifetime of positrons trapped by dislocations is 166 ps, and the intensity of this component increases from 83% in the initial samples to 99.4% after pressing with e = 9.55. The initial samples contain a component with a positron lifetime of 192 ps (intensity 16.4%), which corresponds to the presence of monovacancies in the nickel sublattice of the B2 phase (concentration ≈10−5). This component is absent in the positron lifetime spectra in the samples after pressing. The results of the analysis of the Doppler broadening spectroscopy correlate with the data obtained by the positron annihilation lifetime spectroscopy
    corecore